Big Project: Fantasy Premier League (FPL) Points PredictorΒΆ

Data Rules Everything Around Me (DREAM) TEAM - Fall 2024 - CME 538ΒΆ

Feras Abdulla - Maha Fakhroo - Syed Shahid Hossaini - Eric GuanΒΆ


Exploratory Data Analysis (EDA) and VisualizationsΒΆ


Below, the different columns available in the database are listed and explained:

  1. 'name': Name of player
  2. 'position': Position on the pitch (Goalkeeper, Defender, Midfielder, Forward)
  3. 'team': Premier League team with which the player is affiliated
  4. 'xP': Expected points for a the player in the given fixture
  5. 'assists': Actual number of assists
  6. 'bonus': Actual number of bonus points awarded
  7. 'bps': Stands for 'Bonus Points System', a raw score based on performance metrics like goals, assists, clean sheets, saves, tackles, and other contributions that is used to rank players and determine 'bonus' scores.
  8. 'clean_sheets': Boolean column identifying whether the player earned points for a clean sheet (i.e., his team conceded zero goals while he was on the pitch).
  9. 'creativity': A measure of a player’s potential to create scoring opportunities (passes, crosses, etc.).
  10. 'element': A unique ID for the player in the FPL system.
  11. 'fixture': The ID of the match the player participated in.
  12. 'goals_conceded': The number of goals the player's team conceded while they were on the pitch.
  13. 'goals_scored': The number of goals scored by the player.
  14. 'Influence_Creativity_Threat_Index': A combined metric summarizing the player's influence, creativity, and threat.
  15. 'influence': A measure of a player’s impact on a match (defensive and offensive contributions).
  16. 'kickoff_time': The start time of the match.
  17. 'minutes': The number of minutes the player was on the pitch during the match.
  18. 'opponent_team': The ID of the opposing team in the fixture.
  19. 'own_goals': The number of own goals scored by the player.
  20. 'penalties_missed': The number of penalty kicks missed by the player.
  21. 'penalties_saved': The number of penalty kicks saved by the player (goalkeepers only).
  22. 'red_cards': The number of red cards received by the player.
  23. 'round': The fantasy round number of the match.
  24. 'saves': The number of saves made by the player (goalkeepers only).
  25. 'selected': The number of FPL managers who selected the player for their teams in this round.
  26. 'team_a_score': The number of goals scored by the away team in the match.
  27. 'team_h_score': The number of goals scored by the home team in the match.
  28. 'threat': A measure of a player’s likelihood of scoring goals based on their attacking actions.
  29. 'total_points': The total FPL points earned by the player in the match. This column will be our label.
  30. 'transfers_balance': The net number of transfers for the player (transfers in minus transfers out).
  31. 'transfers_in': The number of FPL teams that transferred the player in before this match.
  32. 'transfers_out': The number of FPL teams that transferred the player out before this match.
  33. 'value': The player’s price in FPL (in millions GBP).
  34. 'was_home': A boolean indicating if the player's team was playing at home (True/1 = home, False/0 = away).
  35. 'yellow_cards': The number of yellow cards received by the player.
  36. 'GW': The specific gameweek for the match.
  37. 'expected_goals': A metric predicting the likelihood of the player scoring based on their chances.
  38. 'expected_assists': A metric predicting the likelihood of the player assisting a goal.
  39. 'expected_goal_involvements': The sum of 'expected_goals' and 'expected_assists', representing the player’s total expected goal contributions.

Import PackagesΒΆ

Imports essential Python libraries and machine learning tools for data analysis, visualization, and model evaluation, as well as functions for splitting data into training and testing sets. These are typically used in machine learning projects to build and assess predictive models.

InΒ [73]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
import matplotlib.lines as mlines
import ast
import unicodedata

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

Import DataΒΆ

Let's start by importing the 'master.csv' file into a dataframe.

InΒ [74]:
master = pd.read_csv('../master.csv')
master.head()
Out[74]:
name position team xP assists bonus bps clean_sheets creativity element ... transfers_balance transfers_in transfers_out value was_home yellow_cards GW expected_goals expected_assists expected_goal_involvements
0 Aaron Connolly FWD Brighton 0.5 0 0 -3 0 0.3 78 ... 0 0 0 55 True 0 1 0.392763 0.000000 0.392763
1 Aaron Cresswell DEF West Ham 2.1 0 0 11 0 11.2 435 ... 0 0 0 50 True 0 1 0.000000 0.000000 0.000000
2 Aaron Mooy MID Brighton 0.0 0 0 0 0 0.0 60 ... 0 0 0 50 True 0 1 NaN NaN NaN
3 Aaron Ramsdale GK Sheffield Utd 2.5 0 0 12 0 0.0 483 ... 0 0 0 50 True 0 1 0.000000 0.000000 0.000000
4 Abdoulaye DoucourAΒ© MID Everton 1.3 0 0 20 1 44.6 512 ... 0 0 0 55 False 0 1 0.000000 0.205708 0.205708

5 rows Γ— 39 columns

InΒ [75]:
master.shape
Out[75]:
(111920, 39)
InΒ [76]:
master.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111920 entries, 0 to 111919
Data columns (total 39 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   name                               111920 non-null  object 
 1   position                           111920 non-null  object 
 2   team                               111920 non-null  object 
 3   xP                                 111920 non-null  float64
 4   assists                            111920 non-null  int64  
 5   bonus                              111920 non-null  int64  
 6   bps                                111920 non-null  int64  
 7   clean_sheets                       111920 non-null  int64  
 8   creativity                         111920 non-null  float64
 9   element                            111920 non-null  int64  
 10  fixture                            111920 non-null  int64  
 11  goals_conceded                     111920 non-null  int64  
 12  goals_scored                       111920 non-null  int64  
 13  Influence_Creativity_Threat_Index  111920 non-null  float64
 14  influence                          111920 non-null  float64
 15  kickoff_time                       111920 non-null  object 
 16  minutes                            111920 non-null  int64  
 17  opponent_team                      111920 non-null  int64  
 18  own_goals                          111920 non-null  int64  
 19  penalties_missed                   111920 non-null  int64  
 20  penalties_saved                    111920 non-null  int64  
 21  red_cards                          111920 non-null  int64  
 22  round                              111920 non-null  int64  
 23  saves                              111920 non-null  int64  
 24  selected                           111920 non-null  int64  
 25  team_a_score                       111920 non-null  int64  
 26  team_h_score                       111920 non-null  int64  
 27  threat                             111920 non-null  float64
 28  total_points                       111920 non-null  int64  
 29  transfers_balance                  111920 non-null  int64  
 30  transfers_in                       111920 non-null  int64  
 31  transfers_out                      111920 non-null  int64  
 32  value                              111920 non-null  int64  
 33  was_home                           111920 non-null  bool   
 34  yellow_cards                       111920 non-null  int64  
 35  GW                                 111920 non-null  int64  
 36  expected_goals                     79272 non-null   float64
 37  expected_assists                   79272 non-null   float64
 38  expected_goal_involvements         79272 non-null   float64
dtypes: bool(1), float64(8), int64(26), object(4)
memory usage: 32.6+ MB

Data Exploration and CleaningΒΆ

Aligning Categorical ColumnsΒΆ

Let's examine the columns that contain categorical data, including how many unique values each contains and what those unique values are.

InΒ [77]:
# Define categorical columns
categorical_columns = ['position', 'team']

# Calculate how many unique values there are for each categorical column
unique_vals = master[categorical_columns].nunique()
print("Unique values in categorical features:")
print(unique_vals)

# Print different categorical values
for i in categorical_columns:
    print(f"\nUnique values in '{i}':")
    print(master[i].unique())
Unique values in categorical features:
position     5
team        27
dtype: int64

Unique values in 'position':
['FWD' 'DEF' 'MID' 'GK' 'GKP']

Unique values in 'team':
['Brighton' 'West Ham' 'Sheffield Utd' 'Everton' 'Fulham' 'Wolves' 'Leeds'
 'Leicester' 'Liverpool' 'West Brom' 'Arsenal' 'Southampton' 'Newcastle'
 'Chelsea' 'Crystal Palace' 'Spurs' 'Man Utd' 'Man City' 'Aston Villa'
 'Burnley' 'Watford' 'Norwich' 'Brentford' 'Bournemouth' "Nott'm Forest"
 'Luton' 'Ipswich']

As we can see, there are 5 unique player positions: 'FWD' = Forward, 'DEF' = Defender, 'MID' = Midfielder, and 'GK/GKP' = Goalkeeper. Both GK and GKP refer to the same position but are written differently due to different syntax formats across different seasons, so we will have to align them into one value: 'GK'. We will also change the abbreviated name of Nottingham Forest in the team column, 'Nott'm Forest', to the full name: 'Nottingham Forest'.

InΒ [78]:
master['position'] = master['position'].replace('GKP', 'GK') # GKP --> GK
master['team'] = master['team'].replace("Nott'm Forest", "Nottingham Forest") # Nott'm Forest --> Nottingham Forest

# Calculate how many unique values there are for each categorical column
unique_vals = master[categorical_columns].nunique()
print("Unique values in categorical features:")
print(unique_vals)

# Print different categorical values
for i in categorical_columns:
    print(f"\nUnique values in '{i}':")
    print(master[i].unique())
Unique values in categorical features:
position     4
team        27
dtype: int64

Unique values in 'position':
['FWD' 'DEF' 'MID' 'GK']

Unique values in 'team':
['Brighton' 'West Ham' 'Sheffield Utd' 'Everton' 'Fulham' 'Wolves' 'Leeds'
 'Leicester' 'Liverpool' 'West Brom' 'Arsenal' 'Southampton' 'Newcastle'
 'Chelsea' 'Crystal Palace' 'Spurs' 'Man Utd' 'Man City' 'Aston Villa'
 'Burnley' 'Watford' 'Norwich' 'Brentford' 'Bournemouth'
 'Nottingham Forest' 'Luton' 'Ipswich']

Encoding the Categorical FeaturesΒΆ

Now, to encode the categorical features. We will one-hot encode the 'position' column, since there are only 4 unique values. However, we will label-encode the 'team' column and create a new column: 'team_label'. We do this because we have 27 unique teams in the database, and one-hot encoding 'team' would increase dimensionality substantially. Furthermore, our ML model will be a Random Forest, which does not infer ordinality, so label encoding will not be an issue.

InΒ [79]:
# One-hot encode the 'position' column while retaining the original column
position_dummies = pd.get_dummies(master['position'], prefix='position')
master = pd.concat([master, position_dummies], axis=1)

# Label encode the 'team' column while retaining the original column
le = LabelEncoder()
master['team_label'] = le.fit_transform(master['team'])

master.head()
Out[79]:
name position team xP assists bonus bps clean_sheets creativity element ... yellow_cards GW expected_goals expected_assists expected_goal_involvements position_DEF position_FWD position_GK position_MID team_label
0 Aaron Connolly FWD Brighton 0.5 0 0 -3 0 0.3 78 ... 0 1 0.392763 0.000000 0.392763 False True False False 4
1 Aaron Cresswell DEF West Ham 2.1 0 0 11 0 11.2 435 ... 0 1 0.000000 0.000000 0.000000 True False False False 25
2 Aaron Mooy MID Brighton 0.0 0 0 0 0 0.0 60 ... 0 1 NaN NaN NaN False False False True 4
3 Aaron Ramsdale GK Sheffield Utd 2.5 0 0 12 0 0.0 483 ... 0 1 0.000000 0.000000 0.000000 False False True False 20
4 Abdoulaye DoucourAΒ© MID Everton 1.3 0 0 20 1 44.6 512 ... 0 1 0.000000 0.205708 0.205708 False False False True 8

5 rows Γ— 44 columns

Converting to DateTime, Extracting Time Elements, and adding a Season Identifier ColumnΒΆ

Now let us convert the 'kickoff_time' column into datetime format, extract the 'Hour', 'DayOfWeek', 'Month', 'Weekend', and 'WeekOfYear' elements, and store them into new columns in the 'master' df. We will also create a column called 'Season' that will classify each row in the correct Premier League season between 2020-2021 and 2024-2025, based on the 'kickoff_time'.

InΒ [80]:
master['kickoff_time'] = pd.to_datetime(master['kickoff_time']).dt.tz_localize(None)

master['Hour'] = master['kickoff_time'].dt.hour # Extract hour
master['DayOfWeek'] = master['kickoff_time'].dt.dayofweek # Extract day of week (Monday = 0 to Sunday = 6)
master['Weekend'] = master['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0) # Determine if weekend (1 if yes, 0 if no)
master['WeekOfYear'] = master['kickoff_time'].dt.isocalendar().week # Extract week of year
master['Month'] = master['kickoff_time'].dt.month # Extract month
master['Year'] = master['kickoff_time'].dt.year # Extract year

# Define the function to assign seasons
def assign_season(kickoff_time):
    if pd.Timestamp('2020-08-01') <= kickoff_time <= pd.Timestamp('2021-05-31'):
        return '2020-2021'
    elif pd.Timestamp('2021-08-01') <= kickoff_time <= pd.Timestamp('2022-05-31'):
        return '2021-2022'
    elif pd.Timestamp('2022-08-01') <= kickoff_time <= pd.Timestamp('2023-05-31'):
        return '2022-2023'
    elif pd.Timestamp('2023-08-01') <= kickoff_time <= pd.Timestamp('2024-05-31'):
        return '2023-2024'
    elif pd.Timestamp('2024-08-01') <= kickoff_time <= pd.Timestamp('2025-05-31'):
        return '2024-2025'
    else:
        return None  # If the date doesn't fall into any range

# Apply the function to create the 'season' column
master['Season'] = master['kickoff_time'].apply(assign_season)

master.head()
Out[80]:
name position team xP assists bonus bps clean_sheets creativity element ... position_GK position_MID team_label Hour DayOfWeek Weekend WeekOfYear Month Year Season
0 Aaron Connolly FWD Brighton 0.5 0 0 -3 0 0.3 78 ... False False 4 19 0 0 38 9 2020 2020-2021
1 Aaron Cresswell DEF West Ham 2.1 0 0 11 0 11.2 435 ... False False 25 19 5 1 37 9 2020 2020-2021
2 Aaron Mooy MID Brighton 0.0 0 0 0 0 0.0 60 ... False True 4 19 0 0 38 9 2020 2020-2021
3 Aaron Ramsdale GK Sheffield Utd 2.5 0 0 12 0 0.0 483 ... True False 20 17 0 0 38 9 2020 2020-2021
4 Abdoulaye DoucourAΒ© MID Everton 1.3 0 0 20 1 44.6 512 ... False True 8 15 6 1 37 9 2020 2020-2021

5 rows Γ— 51 columns

Decoding Player Names into Conventional Alphabetical FormatΒΆ

Now, we should filter the 'name' column to remove any symbols/non-alphabetical characters, to decode the text into normal alphabet format.

InΒ [81]:
def remove_accents(df):
    ''' Replace recurring symbols with their alphanumeric counterparts '''
    df['name'] = df['name'].str.replace('AΒ©', 'Γ©', regex=False)
    df['name'] = df['name'].str.replace('AΒ§', 'Γ§', regex=False)
    df['name'] = df['name'].str.replace('AΒ­', 'Γ­', regex=False)
    df['name'] = df['name'].str.replace('A3', 'Γ³', regex=False)
    df['name'] = df['name'].str.replace('AΒΆ', 'ΓΆ', regex=False)
    df['name'] = df['name'].str.replace('A1⁄4', 'ΓΌ', regex=False)
    df['name'] = df['name'].str.replace('AΒ€', 'Γ€', regex=False)
    df['name'] = df['name'].str.replace('AΒ«', 'Γ«', regex=False)
    df['name'] = df['name'].str.replace('AΒ£', 'Γ£', regex=False)
    df['name'] = df['name'].str.replace('A\x87', 'Δ‡', regex=False)
    df['name'] = df['name'].str.replace('A\x98', 'O', regex=False)
    df['name'] = df['name'].str.replace('A\x82', 'l', regex=False)


    ''' Deal with specific outlier names '''
    df['name'] = df['name'].str.replace('FernAΒ‘ndez', 'FernΓ‘ndez', regex=False)
    df['name'] = df['name'].str.replace('Marek RodAΒ‘k', 'Marek RodΓ‘k', regex=False)
    df['name'] = df['name'].str.replace('GroA', 'Groß', regex=False)
    df['name'] = df['name'].str.replace('Davinson SAΒ‘nchez', 'Davinson SΓ‘nchez', regex=False)
    df['name'] = df['name'].str.replace('Cengiz Aœnder', 'Cengiz Under', regex=False)
    df['name'] = df['name'].str.replace('FabiAΒ‘n Balbuena', 'FabiΓ‘n Balbuena', regex=False)
    df['name'] = df['name'].str.replace('Robert SAΒ‘nchez', 'Robert SΓ‘nchez', regex=False)
    df['name'] = df['name'].str.replace('SaAol AΒ‘iguez', 'SaΓΊl NΓ­guez', regex=False)
    df['name'] = df['name'].str.replace('Alvaro', 'Alvaro', regex=False)
    df['name'] = df['name'].str.replace('Son Heung-min', 'Heung-Min Son', regex=False)
    df['name'] = df['name'].str.replace('AdriAΒ‘n San Miguel del Castillo', 'AdriΓ‘n San Miguel del Castillo', regex=False)
    df['name'] = df['name'].str.replace('A\x96zil', 'Ozil', regex=False)
    df['name'] = df['name'].str.replace('A\x87aglar', 'Caglar', regex=False)
    df['name'] = df['name'].str.replace('AdriAΒ‘n BernabΓ©', 'AdriΓ‘n BernabΓ©', regex=False)
    df['name'] = df['name'].str.replace('NicolAΒ‘s Otamendi', 'NicolΓ‘s Otamendi', regex=False)
    df['name'] = df['name'].str.replace('Thiago AlcAΒ‘ntara', 'Thiago AlcΓ‘ntara', regex=False)
    df['name'] = df['name'].str.replace('SaA d Benrahma', 'Said Benrahma', regex=False)
    df['name'] = df['name'].str.replace('ImrAΒ’n', 'Imran', regex=False)
    df['name'] = df['name'].str.replace('DerviA\x9foA\x9flu', 'Dervişoğlu', regex=False)
    df['name'] = df['name'].str.replace('Francisco Jorge TomAΒ‘s Oliveira', 'Francisco Jorge TomΓ‘s Oliveira', regex=False)
    df['name'] = df['name'].str.replace('Benjamin Chilwell', 'Ben Chilwell', regex=False)
    df['name'] = df['name'].str.replace('Emiliano MartΓ­nez Romero', 'Emiliano MartΓ­nez', regex=False)
    df['name'] = df['name'].str.replace('Gabriel dos Santos MagalhΓ£es', 'Gabriel MagalhΓ£es', regex=False)
    df['name'] = df['name'].str.replace('Gabriel Teodoro Martinelli Silva', 'Gabriel Martinelli', regex=False)
    df['name'] = df['name'].str.replace('Gabriel Martinelli Silva', 'Gabriel Martinelli', regex=False)
    df['name'] = df['name'].str.replace('Joelinton CAΒ‘ssio ApolinAΒ‘rio de Lira', 'Joelinton', regex=False)
    df['name'] = df['name'].str.replace('Matteo Guendouzi', 'MattΓ©o Guendouzi', regex=False)
    df['name'] = df['name'].str.replace('Romain SaA ss', 'Romain SaΓ―ss', regex=False)
    df['name'] = df['name'].str.replace('Pablo HernAΒ‘ndez DomΓ­nguez', 'Pablo HernΓ‘ndez DomΓ­nguez', regex=False)
    df['name'] = df['name'].str.replace('RAoben Diogo da Silva Neves', 'RΓΊben da Silva Neves', regex=False)
    df['name'] = df['name'].str.replace('Paulo Gazzaniga Farias', 'Paulo Gazzaniga', regex=False)
    df['name'] = df['name'].str.replace('Tanguy NdombΓ©lΓ© Alvaro', 'Tanguy Ndombele', regex=False)
    df['name'] = df['name'].str.replace('Bruno Borges Fernandes', 'Bruno Fernandes', regex=False)
    df['name'] = df['name'].str.replace('Bruno Miguel Borges Fernandes', 'Bruno Fernandes', regex=False)

    return df

master = remove_accents(master)

DuplicatesΒΆ

Let's also examine the data for the presence of duplicate rows.

InΒ [82]:
print(f"Number of duplicate rows in the master data set: {master.duplicated().sum()}")
Number of duplicate rows in the master data set: 0

Nice, no duplicates! Also, this makes sense, since each row in our data represents a unique player in a unique gameweek in a unique fixture, so duplicates would have indicated errors in data sourcing.

Missing Values / NaNsΒΆ

First, let's explore how many missing values exist in the master dataframe:

InΒ [83]:
# Determine missing values for each column
missing_values = master.isnull().sum()

# Create a df of missing values
missing_df = pd.DataFrame({'Missing Values': missing_values})

# Show all rows
pd.set_option('display.max_rows', None)  # Show all rows
missing_df
Out[83]:
Missing Values
name 0
position 0
team 0
xP 0
assists 0
bonus 0
bps 0
clean_sheets 0
creativity 0
element 0
fixture 0
goals_conceded 0
goals_scored 0
Influence_Creativity_Threat_Index 0
influence 0
kickoff_time 0
minutes 0
opponent_team 0
own_goals 0
penalties_missed 0
penalties_saved 0
red_cards 0
round 0
saves 0
selected 0
team_a_score 0
team_h_score 0
threat 0
total_points 0
transfers_balance 0
transfers_in 0
transfers_out 0
value 0
was_home 0
yellow_cards 0
GW 0
expected_goals 32648
expected_assists 32648
expected_goal_involvements 32648
position_DEF 0
position_FWD 0
position_GK 0
position_MID 0
team_label 0
Hour 0
DayOfWeek 0
Weekend 0
WeekOfYear 0
Month 0
Year 0
Season 0

So, we see that our efforts to impute/drop missing values will have to focus on three main features: 'expected_goals', 'expected_assists', and 'expected_goal_involvements'. We will go ahead and impute these missing values with the mean of that player's xG/xA/xGI for that specific season, to provide a temporally appropriate context for the substitution. If no values exist in that season, we will impute with the mean of that player's xG/xA/xGI across all seasons.

InΒ [84]:
# Function to impute missing values for grouped data
def impute_mean_per_group(df, group_cols):
    # Identify columns with missing values
    missing_columns = df.columns[df.isnull().any()]
    
    for col in missing_columns:
        # Step 1: Impute missing values using the mean for each group (name, season)
        df[col] = df.groupby(group_cols)[col].transform(
            lambda group: group.fillna(group.mean())
        )
        
        # Step 2: Handle edge cases where the group mean couldn't be calculated
        # Fall back to mean for the player across all seasons
        df[col] = df.groupby('name')[col].transform(
            lambda group: group.fillna(group.mean())
        )
    
    return df

# Apply the imputation
master = impute_mean_per_group(master, ['name', 'Season'])

# Determine missing values for each column
missing_values = master.isnull().sum()

# Create a df of missing values
missing_df = pd.DataFrame({'Missing Values': missing_values})

# Show all rows
pd.set_option('display.max_rows', None)  # Show all rows
missing_df
Out[84]:
Missing Values
name 0
position 0
team 0
xP 0
assists 0
bonus 0
bps 0
clean_sheets 0
creativity 0
element 0
fixture 0
goals_conceded 0
goals_scored 0
Influence_Creativity_Threat_Index 0
influence 0
kickoff_time 0
minutes 0
opponent_team 0
own_goals 0
penalties_missed 0
penalties_saved 0
red_cards 0
round 0
saves 0
selected 0
team_a_score 0
team_h_score 0
threat 0
total_points 0
transfers_balance 0
transfers_in 0
transfers_out 0
value 0
was_home 0
yellow_cards 0
GW 0
expected_goals 9772
expected_assists 9772
expected_goal_involvements 9772
position_DEF 0
position_FWD 0
position_GK 0
position_MID 0
team_label 0
Hour 0
DayOfWeek 0
Weekend 0
WeekOfYear 0
Month 0
Year 0
Season 0

As we can see, we still have almost 10,000 missing values in each of 'expected_goals', 'expected_assists', and 'expected_goal_involvements'. These exist even after trying to impute based on both season/player context and player context. Therefore, we will go ahead and drop the rows with the remaining missing values.

InΒ [85]:
# Define the columns to check for missing values
master = master.dropna()
master.shape
Out[85]:
(102148, 51)

Filtering Out Rows with Zero or Limited Minutes PlayedΒΆ

Now, we will filter out rows where the number of minutes played, 'minutes', is zero. The original 'master' dataframe includes all players in a season, including those who are listed in a team's squad but do not play (those on the bench).

We should also filter out rows with limited 'minutes' of gameplay, because they tend to have incomplete or missing data. They may also bias our model and analysis by including late-game strategies (i.e., some managers might substitute a player in at the end of a game where his team is leading to increase defensive posture and maintain their lead). However, in order to avoid favoring early starters, we need to strike a balance in choosing the 'minutes' threshold.

We will proceed by filtering out players with less than 5 minutes of gameplay.

InΒ [86]:
# Print original length
print('Original Length of master dataframe: ', master.shape[0])

# Print the number of rows with minutes = 0
print(f"Number of rows with minutes = 0: {master[master['minutes'] == 0].shape[0]}")
# Print the number of rows with minutes between 0 and 5
print(f"Number of rows with minutes between 0 and 5: {master[(master['minutes'] > 0) & (master['minutes'] < 5)].shape[0]}")
Original Length of master dataframe:  102148
Number of rows with minutes = 0: 57408
Number of rows with minutes between 0 and 5: 1773

As we can see, a large portion of the database (57,408 rows or 56.2%) represented players who were not active in a given fixture. By filtering these out, we can focus on players with concrete contributions when creating visualizations and designing our model. Furthermore, players with between 0 and 5 minutes of gameplay will also be dropped, and they represent a much smaller number of entries (1773 rows).

InΒ [87]:
master = master[master['minutes'] >= 5]

print('Length of master dataframe after filtering out players with 0-5 minutes of gameplay:', master.shape[0])
Length of master dataframe after filtering out players with 0-5 minutes of gameplay: 42967

Dealing with Invalid RowsΒΆ

Now, let us examine invalid occurrences of expected goal metrics. If a player's 'goals_scored' are greater than zero, then that row's 'expected_goals' cannot be zero. Therefore, we need to examine the dataframe for those conditions and impute 'expected_goals' to be the average of the rows that do not meet this condition for that player in that season.

InΒ [88]:
# Filter rows where goals_scored > 0 and expected_goals == 0
invalid_rows = master[(master['goals_scored'] > 0) & (master['expected_goals'] == 0)]

# Count how many times this happens
invalid_count = len(invalid_rows)

print(f"Number of rows where goals_scored > 0 but expected_goals = 0: {invalid_count}")
Number of rows where goals_scored > 0 but expected_goals = 0: 373
InΒ [89]:
# Impute expected_goals with the mean for that player and season,
# with fallback to player-level or global mean
def impute_expected_goals(row):
    if row['goals_scored'] > 0 and row['expected_goals'] == 0:
        # Calculate the mean of expected_goals for the player and season
        season_mean = master[
            (master['name'] == row['name']) & 
            (master['Season'] == row['Season']) & 
            (master['expected_goals'] > 0)
        ]['expected_goals'].mean()
        
        # Fallback to the mean for the player across all seasons
        if pd.isna(season_mean):
            player_mean = master[
                (master['name'] == row['name']) & 
                (master['expected_goals'] > 0)
            ]['expected_goals'].mean()
            return player_mean if pd.notna(player_mean) else master['expected_goals'].mean()
        
        return season_mean
    else:
        return row['expected_goals']  # Leave unchanged

# Apply the imputation
master['expected_goals'] = master.apply(impute_expected_goals, axis=1)

# Verify the result
rows_with_condition = master[(master['goals_scored'] > 0) & (master['expected_goals'] == 0)]
print(f"Number of rows where goals_scored > 0 but expected_goals = 0: {len(rows_with_condition)}")
Number of rows where goals_scored > 0 but expected_goals = 0: 0

Negative ValuesΒΆ

Now, we need to check for negative values

InΒ [90]:
for column in master.columns:
    # Check if the column is numeric
    if master[column].dtype in ['int64', 'float64']:
        # Filter rows with negative values
        negatives = master[master[column] < 0]
        if not negatives.empty:
            print(f"Negative values found in column '{column}':")
            print(len(negatives))
            print("\n")
Negative values found in column 'xP':
1621


Negative values found in column 'bps':
1490


Negative values found in column 'total_points':
502


Negative values found in column 'transfers_balance':
21739


'total_points', 'bps', and 'transfers_balance' can have negative values. Players can be penalized for events like own goals, red cards, and goals conceded, so those negatives can and should be retained. In addition, 'transfers_balance' is a net figure that represents transfers in minus transfers out, so no issue with negatives here either.

However, 'xP' values are generally non-negative because they are probabilities multiplied by point weights. Negative xPs can indicate errors in data sourcing. Negative 'xP' will thus be imputed with the average of that player's xP from the gameweeks before and after the negative value. Care will be taken so that there is no jumping between seasons, since the 'master' dataframe is a concatenation of several seasons. If adjacent xPs for that same player are also negative, it will be replaced with the closest neighbor.

InΒ [91]:
# Replace negative xP values with NaN
master.loc[master['xP'] < 0, 'xP'] = np.nan

# Function to impute xP
def impute_xp(df):
    # Iterate through each player's data
    for name, group in df.groupby('name'):
        # Loop through rows with NaN in xP
        for idx in group[group['xP'].isna()].index:
            current_gw = df.loc[idx, 'GW']

            # Check for previous and next GWs in the same season
            previous_idx = group[
                (group['GW'] < current_gw) & (~group['xP'].isna())
            ].index.max()
            next_idx = group[
                (group['GW'] > current_gw) & (~group['xP'].isna())
            ].index.min()

            if pd.notna(previous_idx) and pd.notna(next_idx):
                # Average of the previous and next valid xP values
                df.loc[idx, 'xP'] = (df.loc[previous_idx, 'xP'] + df.loc[next_idx, 'xP']) / 2
            elif pd.notna(previous_idx):
                # Use the previous valid xP value
                df.loc[idx, 'xP'] = df.loc[previous_idx, 'xP']
            elif pd.notna(next_idx):
                # Use the next valid xP value
                df.loc[idx, 'xP'] = df.loc[next_idx, 'xP']
            else:
                # Fallback: Use the closest available xP value
                neighbor_idx = group[~group['xP'].isna()].index.min()
                if pd.notna(neighbor_idx):
                    df.loc[idx, 'xP'] = df.loc[neighbor_idx, 'xP']
    return df

# Apply the imputation function
master = impute_xp(master)

# Verify the result
print(len(master[master['xP'].isna()]))  # Should be empty if all NaNs are imputed
29

The remaining NaNs are likely due to zero applicable values to impute with, based on our imputation conditions. Therefore, let's go ahead and drop these 29 rows.

InΒ [92]:
print(master.shape)
master = master.dropna()
print(master.shape)
(42967, 51)
(42938, 51)

Adding Cumulative and Combination FeaturesΒΆ

Let's also go ahead and add some cumulative/combined metric columns to our database. In particular, let's add 'goals per minute': 'gpm', 'assists per minute': 'apm', 'cumulative_gpm', 'cumulative_apm', 'cumulative_goals', 'cumulative_assists', 'cumulative_xG', 'cumulative_xA', 'cumulative_xGI', 'cumulative_xP', and 'cumulative_points'.

InΒ [93]:
master['gpm'] = master['goals_scored']/master['minutes'] # Create a column for goals per minute
master['apm'] = master['assists']/master['minutes'] # Create a column for assists per minute

# Ensure the 'master' DataFrame is sorted by 'season', 'name', and 'kickoff_time'
master = master.sort_values(by=['Season', 'name', 'kickoff_time'])

# Define a function to calculate cumulative metrics for each season
def calculate_cumulative_metrics(group):
    group['cumulative_goals'] = group['goals_scored'].cumsum()
    group['cumulative_assists'] = group['assists'].cumsum()
    group['cumulative_xG'] = group['expected_goals'].cumsum()
    group['cumulative_xA'] = group['expected_assists'].cumsum()
    group['cumulative_xGI'] = group['expected_goal_involvements'].cumsum()
    group['cumulative_gpm'] = group['cumulative_goals'] / group['minutes'].cumsum()
    group['cumulative_apm'] = group['cumulative_assists'] / group['minutes'].cumsum()
    group['cumulative_xP'] = group['xP'].cumsum()
    group['cumulative_points'] = group['total_points'].cumsum()
    group['cumulative_minutes'] = group['minutes'].cumsum()
    return group

# Group by both 'season' and 'name', then apply the function
master = master.groupby(['Season', 'name']).apply(calculate_cumulative_metrics)

# Reset the index
master.reset_index(drop=True, inplace=True)

# Display the head
master.head()
Out[93]:
name position team xP assists bonus bps clean_sheets creativity element ... cumulative_goals cumulative_assists cumulative_xG cumulative_xA cumulative_xGI cumulative_gpm cumulative_apm cumulative_xP cumulative_points cumulative_minutes
0 Aaron Connolly FWD Brighton 0.5 0 0 -3 0 0.3 78 ... 0 0 0.392763 0.000000 0.392763 0.000000 0.000000 0.5 1 45
1 Aaron Connolly FWD Brighton 4.0 0 2 27 1 11.3 78 ... 1 0 0.554273 0.016604 0.570877 0.007463 0.000000 4.5 9 134
2 Aaron Connolly FWD Brighton 2.7 0 0 2 0 12.1 78 ... 1 0 0.586928 0.057287 0.644215 0.004831 0.000000 7.2 11 207
3 Aaron Connolly FWD Brighton 2.7 0 0 7 0 0.3 78 ... 1 0 0.586928 0.057287 0.644215 0.003676 0.000000 9.9 13 272
4 Aaron Connolly FWD Brighton 3.0 1 0 13 0 10.3 78 ... 1 1 0.586928 0.109529 0.696457 0.003521 0.003521 12.9 17 284

5 rows Γ— 63 columns

Converting Player 'value' Unit to Million GBP (Β£)ΒΆ

The unit for player 'value' is also 100,000s of GBP (Β£). For example, a 'value' of 50 is equivalent to Β£5 million. Therefore, let's go ahead and convert that column to units of millions of GBP.

InΒ [94]:
master['value'] = master['value']/10

Saving a New, Cleaned CSV FileΒΆ

Now, finally, let's go ahead and save a cleaned master file to CSV format.

InΒ [95]:
master.to_csv('../master_cleaned.csv', index=False)

VisualizationsΒΆ

First, let's import the cleaned/filtered data into a new dataframe called 'master_cleaned'

InΒ [96]:
master_cleaned = pd.read_csv('../master_cleaned.csv')
master_cleaned.head()
Out[96]:
name position team xP assists bonus bps clean_sheets creativity element ... cumulative_goals cumulative_assists cumulative_xG cumulative_xA cumulative_xGI cumulative_gpm cumulative_apm cumulative_xP cumulative_points cumulative_minutes
0 Aaron Connolly FWD Brighton 0.5 0 0 -3 0 0.3 78 ... 0 0 0.392763 0.000000 0.392763 0.000000 0.000000 0.5 1 45
1 Aaron Connolly FWD Brighton 4.0 0 2 27 1 11.3 78 ... 1 0 0.554273 0.016604 0.570877 0.007463 0.000000 4.5 9 134
2 Aaron Connolly FWD Brighton 2.7 0 0 2 0 12.1 78 ... 1 0 0.586928 0.057287 0.644215 0.004831 0.000000 7.2 11 207
3 Aaron Connolly FWD Brighton 2.7 0 0 7 0 0.3 78 ... 1 0 0.586928 0.057287 0.644215 0.003676 0.000000 9.9 13 272
4 Aaron Connolly FWD Brighton 3.0 1 0 13 0 10.3 78 ... 1 1 0.586928 0.109529 0.696457 0.003521 0.003521 12.9 17 284

5 rows Γ— 63 columns

InΒ [97]:
master_cleaned.columns
Out[97]:
Index(['name', 'position', 'team', 'xP', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded',
       'goals_scored', 'Influence_Creativity_Threat_Index', 'influence',
       'kickoff_time', 'minutes', 'opponent_team', 'own_goals',
       'penalties_missed', 'penalties_saved', 'red_cards', 'round', 'saves',
       'selected', 'team_a_score', 'team_h_score', 'threat', 'total_points',
       'transfers_balance', 'transfers_in', 'transfers_out', 'value',
       'was_home', 'yellow_cards', 'GW', 'expected_goals', 'expected_assists',
       'expected_goal_involvements', 'position_DEF', 'position_FWD',
       'position_GK', 'position_MID', 'team_label', 'Hour', 'DayOfWeek',
       'Weekend', 'WeekOfYear', 'Month', 'Year', 'Season', 'gpm', 'apm',
       'cumulative_goals', 'cumulative_assists', 'cumulative_xG',
       'cumulative_xA', 'cumulative_xGI', 'cumulative_gpm', 'cumulative_apm',
       'cumulative_xP', 'cumulative_points', 'cumulative_minutes'],
      dtype='object')

What makes Premier League performance intriguing? Why should we care about metrics like total points, home vs. away trends, and penalty impacts? By understanding these, we can better evaluate players’ consistency, impact, and potential.ΒΆ

We start by looking at all the metrics available and how they correlate to total points.

Total points is an important metric to evaluate player performance as it aggregates key contributions such as goals, assists, clean sheets, and bonus points.

InΒ [98]:
master_cleaned_copy = master_cleaned.copy()

numeric_data = master_cleaned_copy.select_dtypes(include=["float64", "int64"])
 
correlation_matrix = numeric_data.corr()
 
plt.figure(figsize=(15, 12))
sns.heatmap(correlation_matrix,annot=False, cmap="coolwarm",center=0,vmin=-1,vmax=1,square=True,linewidths=0.5,)
plt.title("Heatmap of Variable Correlations (Collinearity Check)")

plt.show()
No description has been provided for this image

Here, we have an overview of the correlations between total points and various Fantasy Premier League metrics, providing a big-picture view of the relationships within the data. At first glance, we can see strong correlations between total points and metrics like BPS, influence, and expected goal involvements, which align with player performance expectations. However, interpreting this heatmap alone has its limitations. The complexity of interactions between variables and potential collinearity make it hard to draw specific, actionable conclusions for team selection.

To uncover deeper insights, we need to break this down further by analyzing metrics specific to player positions.

Our story starts with creating categories to understand the fundamental metrics which will help establish conclusions about player performance.ΒΆ

Category 1: Positional MetricsΒΆ

InΒ [99]:
# Position-wise performance

sns.boxplot(x='position', y='total_points', data=master_cleaned, palette='Set3')
plt.title("Position-wise Distribution of Total Points")
plt.xlabel("Position")
plt.ylabel("Total Points")
plt.show()
No description has been provided for this image

This plot reveals the spread of points by players in different positions. The results showcase the variability within each position.

Midfielders (MID) demonstrate the widest spread of points and the highest potential for top performance (indicated by outliers). Defenders (DEF) and Goalkeepers (GK) have tighter distributions, reflecting more consistent, yet limited scoring opportunities. Forwards (FWD) have high outliers due to exceptional performances.

InΒ [100]:
plt.figure(figsize=(10, 6))
sns.kdeplot(data=master_cleaned, x='total_points', hue='position', fill=True, alpha=0.6)

plt.title("Points Per Match Distribution\nGrouped by Player Position", fontsize=16)
plt.xlabel("Points Per Match", fontsize=12)
plt.ylabel("Density", fontsize=12)

plt.show()
No description has been provided for this image

This plot shows the distribution of points per match for players grouped by their respective positions. We can see that mids have highest density of points per match (suggests they are super consistent with returns). Defs and Fwds have similar distribution but with a wider spread - this suggests variability in returns. There appears to be second distinct peak for defenders that might be for full backs (FBs), who are attacking defenders that have a higher chance of scoring. GKs have distinct narrow distribution which highlights their specialized role (they only get returns when they keep clean sheet and have neglibile avenues to score points beyond clean sheets).

InΒ [101]:
# Calculate the total points per season for each player
season_points = master_cleaned.groupby(['name', 'position', 'Season'])['total_points'].sum().reset_index()

# Define the order of positions for plotting
position_order = ['GK', 'FWD', 'DEF', 'MID']

plt.figure(figsize=(10, 6))
sns.kdeplot(
    data=season_points,
    x='total_points',
    hue='position',
    hue_order=position_order,  # Control the order of positions
    fill = False,
    common_norm = False,
    alpha=0.6
)

plt.title("Total Season Points Distribution\nGrouped by Player Position", fontsize=16)
plt.xlabel("Total Season Points", fontsize=12)
plt.ylabel("Density", fontsize=12)

plt.show()
No description has been provided for this image

Plot shows the distribution of total season points for players grouped by their positions. Each curve represents a KDE (Kernel Density Estimate), indicating how points are distributed for each position. For example, midfielders have a higher peak and fatter tail, suggesting a broader range of high-scoring players. GKs appear to cluster at two distinct buckets and this is because GKs from poor teams converge at the first peak and those from the small elite teams converge in the second peak. FWDs also have the longest tail, indicating the presence of exceptional performers.

InΒ [102]:
# Function to calculate total points per season for each player, filter top n players, and count by position
def get_top_players_and_count_by_position(master_cleaned, n=20):
    # Calculate total points for the season for each player
    season_points = (master_cleaned.groupby(['Season', 'name', 'position'])['total_points'].sum().reset_index().sort_values(by=['Season', 'total_points'], ascending=[True, False]))
    # Get the top n players per season
    top_players = season_points.groupby('Season').head(n)
    
    # Count the number of players by position for each season
    position_counts = top_players.groupby(['Season', 'position'])['name'].count().reset_index()
    position_counts.rename(columns={'name': 'count'}, inplace=True)
    return top_players, position_counts


# Generate the position_counts
_, position_counts = get_top_players_and_count_by_position(master_cleaned, n=50)

# Pivot the data for a clustered bar plot
pivot_data = position_counts.pivot(index='Season', columns='position', values='count')

# Create a clustered bar plot
fig, ax = plt.subplots(figsize=(12, 6))
pivot_data.plot(kind='bar', ax=ax, width=0.8)

# Add labels and title
ax.set_xlabel('Season', fontsize=12)
ax.set_ylabel('Count of Top Players', fontsize=12)
ax.set_title('Top Players by Position Across Seasons', fontsize=14)
ax.legend(title='Position', fontsize=10)

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

This plot shows the count of top players (e.g., top scorers) by position across seasons. Midfielders consistently dominate the top player count across all seasons, followed by defenders and in recent years forwards, with goalkeepers recently having fewer top-performing players. The trend highlights positional disparities in top player representation over time.

InΒ [103]:
avg_points_per_match = master_cleaned.groupby(['Season', 'position'])['total_points'].mean().reset_index()

plt.figure(figsize=(12, 8))
sns.barplot(data=avg_points_per_match, x='position', y='total_points', hue='Season')

plt.title("Average Points Per Match by Position Across Seasons", fontsize=16)
plt.xlabel("Position", fontsize=12)
plt.ylabel("Average Points Per Match", fontsize=12)
plt.legend(title="Season", loc='upper left')
plt.show()
No description has been provided for this image

Bar plot showing average points per match for players across different positions and seasons. Goalkeepers appear to be most consistent in point returns suggesting they have consistent performance throughout the whole season whereas other positions likely have higher weekly variance as they score more cumulatively over the season. Defenders in particular appear to be struggling in recent years.

Similarly, midfielders appear to not only be consistent but also have higher points ceilings as they have the highest density for most season points (fat tail in previous KDE figure: Total Season Points Distribution Grouped by Player Position).

InΒ [104]:
total_season_points = master_cleaned.groupby(['name', 'position', 'Season'])['total_points'].sum().reset_index()

N = 50  # N here represents top N players by position
top_n_players_season = (total_season_points.groupby(['Season', 'position']).apply(lambda group: group.nlargest(N, 'total_points')).reset_index(drop=True))

# calculateing avg points per match (PPM) for top N players + mergie with OG df to estimate PPM
merged_top_n = master_cleaned.merge(top_n_players_season[['name', 'Season', 'position']], on=['name', 'Season', 'position'], how='inner')

avg_points_top_n = merged_top_n.groupby(['Season', 'position'])['total_points'].mean().reset_index() # estatmatign average PPM for top N players

plt.figure(figsize=(12, 8))
sns.barplot(data=avg_points_top_n, x='position', y='total_points', hue='Season')

plt.title(f"Average Points Per Match (Top {N} Players by Position Each Season)", fontsize=16)
plt.xlabel("Position", fontsize=12)
plt.ylabel("Average Points Per Match", fontsize=12)
plt.legend(title="Season", loc='upper left')

plt.show()
No description has been provided for this image

This plot is similar to the one above but with top players only. The trend suggests that midfielders are more likely to be high point-earning potential.

InΒ [105]:
total_season_points = master_cleaned.groupby(['name', 'position', 'Season'])['total_points'].sum().reset_index()

N = 10  # this does exaclty same as above
top_n_players_season = (total_season_points.groupby(['Season', 'position']).apply(lambda group: group.nlargest(N, 'total_points')).reset_index(drop=True))

avg_total_points_top_n = top_n_players_season.groupby(['Season', 'position'])['total_points'].mean().reset_index() #again same calcs here 

plt.figure(figsize=(12, 8))
sns.barplot(data=avg_total_points_top_n, x='position', y='total_points', hue='Season')

plt.title(f"Average Total Season Points (Top {N} Players by Position Each Season)", fontsize=16)
plt.xlabel("Position", fontsize=12)
plt.ylabel("Average Total Season Points", fontsize=12)
plt.legend(title="Season", loc='upper left')
plt.show()
No description has been provided for this image
InΒ [106]:
top_n_players_season
Out[106]:
name position Season total_points
0 Stuart Dallas DEF 2020-2021 171
1 Andrew Robertson DEF 2020-2021 161
2 Trent Alexander-Arnold DEF 2020-2021 160
3 Aaron Cresswell DEF 2020-2021 153
4 Aaron Wan-Bissaka DEF 2020-2021 144
5 Ben Chilwell DEF 2020-2021 139
6 Matt Targett DEF 2020-2021 138
7 Lewis Dunk DEF 2020-2021 130
8 John Stones DEF 2020-2021 128
9 Tyrone Mings DEF 2020-2021 128
10 Harry Kane FWD 2020-2021 242
11 Patrick Bamford FWD 2020-2021 194
12 Jamie Vardy FWD 2020-2021 187
13 Ollie Watkins FWD 2020-2021 168
14 Dominic Calvert-Lewin FWD 2020-2021 165
15 Roberto Firmino FWD 2020-2021 141
16 Chris Wood FWD 2020-2021 138
17 Che Adams FWD 2020-2021 136
18 Callum Wilson FWD 2020-2021 134
19 Danny Ings FWD 2020-2021 131
20 Emiliano MartΓ­nez GK 2020-2021 186
21 Ederson Santana de Moraes GK 2020-2021 160
22 Illan Meslier GK 2020-2021 154
23 Hugo Lloris GK 2020-2021 149
24 Nick Pope GK 2020-2021 144
25 Alisson Ramses Becker GK 2020-2021 140
26 Edouard Mendy GK 2020-2021 140
27 Sam Johnstone GK 2020-2021 140
28 Lukasz Fabianski GK 2020-2021 133
29 Bernd Leno GK 2020-2021 131
30 Bruno Fernandes MID 2020-2021 244
31 Mohamed Salah MID 2020-2021 231
32 Heung-Min Son MID 2020-2021 228
33 Sadio ManΓ© MID 2020-2021 176
34 Marcus Rashford MID 2020-2021 174
35 Jack Harrison MID 2020-2021 160
36 Ilkay GΓΌndogan MID 2020-2021 157
37 James Ward-Prowse MID 2020-2021 156
38 Raheem Sterling MID 2020-2021 154
39 Matheus Pereira MID 2020-2021 153
40 Trent Alexander-Arnold DEF 2021-2022 208
41 Andrew Robertson DEF 2021-2022 186
42 Virgil van Dijk DEF 2021-2022 183
43 Joel Matip DEF 2021-2022 170
44 Aymeric Laporte DEF 2021-2022 160
45 Antonio RΓΌdiger DEF 2021-2022 150
46 Matthew Cash DEF 2021-2022 147
47 Gabriel MagalhΓ£es DEF 2021-2022 146
48 Reece James DEF 2021-2022 140
49 Conor Coady DEF 2021-2022 138
50 Harry Kane FWD 2021-2022 192
51 Cristiano Ronaldo dos Santos Aveiro FWD 2021-2022 159
52 Teemu Pukki FWD 2021-2022 142
53 Michail Antonio FWD 2021-2022 140
54 Ivan Toney FWD 2021-2022 139
55 Emmanuel Dennis FWD 2021-2022 134
56 Jamie Vardy FWD 2021-2022 133
57 Ollie Watkins FWD 2021-2022 131
58 Richarlison de Andrade FWD 2021-2022 125
59 Gabriel Fernando de Jesus FWD 2021-2022 119
60 Alisson Ramses Becker GK 2021-2022 176
61 Hugo Lloris GK 2021-2022 158
62 Ederson Santana de Moraes GK 2021-2022 155
63 Lukasz Fabianski GK 2021-2022 136
64 Aaron Ramsdale GK 2021-2022 135
65 David de Gea GK 2021-2022 132
66 Kasper Schmeichel GK 2021-2022 131
67 Edouard Mendy GK 2021-2022 130
68 Nick Pope GK 2021-2022 130
69 Emiliano MartΓ­nez GK 2021-2022 129
70 Mohamed Salah MID 2021-2022 265
71 Heung-Min Son MID 2021-2022 258
72 Jarrod Bowen MID 2021-2022 206
73 Kevin De Bruyne MID 2021-2022 196
74 Sadio ManΓ© MID 2021-2022 183
75 James Maddison MID 2021-2022 181
76 Bukayo Saka MID 2021-2022 179
77 Diogo Jota MID 2021-2022 175
78 Mason Mount MID 2021-2022 169
79 Raheem Sterling MID 2021-2022 162
80 Kieran Trippier DEF 2022-2023 198
81 Benjamin White DEF 2022-2023 156
82 Trent Alexander-Arnold DEF 2022-2023 155
83 Gabriel MagalhΓ£es DEF 2022-2023 146
84 Ben Mee DEF 2022-2023 143
85 Fabian SchΓ€r DEF 2022-2023 139
86 Tyrone Mings DEF 2022-2023 130
87 Dan Burn DEF 2022-2023 129
88 Sven Botman DEF 2022-2023 128
89 Pervis EstupiΓ±Γ‘n DEF 2022-2023 127
90 Erling Haaland FWD 2022-2023 272
91 Harry Kane FWD 2022-2023 263
92 Ivan Toney FWD 2022-2023 182
93 Ollie Watkins FWD 2022-2023 175
94 Callum Wilson FWD 2022-2023 157
95 Bryan Mbeumo FWD 2022-2023 150
96 Dominic Solanke FWD 2022-2023 130
97 Gabriel Fernando de Jesus FWD 2022-2023 125
98 Brennan Johnson FWD 2022-2023 122
99 Aleksandar Mitrović FWD 2022-2023 107
100 David Raya Martin GK 2022-2023 166
101 Alisson Ramses Becker GK 2022-2023 162
102 David De Gea Quintana GK 2022-2023 161
103 Nick Pope GK 2022-2023 157
104 JosΓ© Malheiro de SΓ‘ GK 2022-2023 148
105 Aaron Ramsdale GK 2022-2023 143
106 Bernd Leno GK 2022-2023 142
107 Emiliano MartΓ­nez GK 2022-2023 135
108 Lukasz Fabianski GK 2022-2023 127
109 Jordan Pickford GK 2022-2023 124
110 Mohamed Salah MID 2022-2023 239
111 Martin Ødegaard MID 2022-2023 212
112 Marcus Rashford MID 2022-2023 205
113 Bukayo Saka MID 2022-2023 202
114 Gabriel Martinelli MID 2022-2023 198
115 Kevin De Bruyne MID 2022-2023 183
116 Bruno Fernandes MID 2022-2023 176
117 Eberechi Eze MID 2022-2023 159
118 Pascal Groß MID 2022-2023 159
119 Miguel AlmirΓ³n Rejala MID 2022-2023 158
120 Benjamin White DEF 2023-2024 181
121 William Saliba DEF 2023-2024 164
122 Gabriel MagalhΓ£es DEF 2023-2024 148
123 Pedro Porro DEF 2023-2024 136
124 Jarrad Branthwaite DEF 2023-2024 124
125 Fabian SchΓ€r DEF 2023-2024 123
126 JoΕ‘ko Gvardiol DEF 2023-2024 123
127 Kyle Walker DEF 2023-2024 123
128 Trent Alexander-Arnold DEF 2023-2024 122
129 Joachim Andersen DEF 2023-2024 121
130 Ollie Watkins FWD 2023-2024 228
131 Erling Haaland FWD 2023-2024 217
132 Dominic Solanke FWD 2023-2024 175
133 Alexander Isak FWD 2023-2024 172
134 Jean-Philippe Mateta FWD 2023-2024 163
135 JuliÑn Álvarez FWD 2023-2024 157
136 Carlton Morris FWD 2023-2024 146
137 Nicolas Jackson FWD 2023-2024 142
138 Matheus Santos Carneiro Da Cunha FWD 2023-2024 135
139 Darwin NΓΊΓ±ez Ribeiro FWD 2023-2024 131
140 Jordan Pickford GK 2023-2024 153
141 David Raya Martin GK 2023-2024 135
142 AndrΓ© Onana GK 2023-2024 133
143 Bernd Leno GK 2023-2024 133
144 Mark Flekken GK 2023-2024 119
145 Alphonse Areola GK 2023-2024 116
146 Emiliano MartΓ­nez GK 2023-2024 115
147 Ederson Santana de Moraes GK 2023-2024 112
148 Guglielmo Vicario GK 2023-2024 112
149 Norberto Murara Neto GK 2023-2024 110
150 Cole Palmer MID 2023-2024 244
151 Bukayo Saka MID 2023-2024 226
152 Phil Foden MID 2023-2024 226
153 Heung-Min Son MID 2023-2024 213
154 Mohamed Salah MID 2023-2024 211
155 Martin Ødegaard MID 2023-2024 186
156 Anthony Gordon MID 2023-2024 183
157 Jarrod Bowen MID 2023-2024 182
158 Kai Havertz MID 2023-2024 180
159 Bruno Fernandes MID 2023-2024 166
160 Virgil van Dijk DEF 2024-2025 45
161 Trent Alexander-Arnold DEF 2024-2025 44
162 JoΕ‘ko Gvardiol DEF 2024-2025 42
163 Gabriel MagalhΓ£es DEF 2024-2025 41
164 Ibrahima KonatΓ© DEF 2024-2025 39
165 Diogo Dalot Teixeira DEF 2024-2025 37
166 Lucas Digne DEF 2024-2025 35
167 Cristian Romero DEF 2024-2025 32
168 Ola Aina DEF 2024-2025 32
169 Andrew Robertson DEF 2024-2025 31
170 Erling Haaland FWD 2024-2025 75
171 Chris Wood FWD 2024-2025 59
172 Danny Welbeck FWD 2024-2025 57
173 Nicolas Jackson FWD 2024-2025 57
174 Ollie Watkins FWD 2024-2025 51
175 Kai Havertz FWD 2024-2025 44
176 Matheus Santos Carneiro Da Cunha FWD 2024-2025 41
177 RaΓΊl JimΓ©nez FWD 2024-2025 41
178 Yoane Wissa FWD 2024-2025 39
179 Jamie Vardy FWD 2024-2025 38
180 AndrΓ© Onana GK 2024-2025 42
181 Matz Sels GK 2024-2025 42
182 Robert SΓ‘nchez GK 2024-2025 39
183 David Raya Martin GK 2024-2025 37
184 Nick Pope GK 2024-2025 36
185 Alisson Ramses Becker GK 2024-2025 35
186 Jordan Pickford GK 2024-2025 33
187 Dean Henderson GK 2024-2025 32
188 Emiliano MartΓ­nez GK 2024-2025 30
189 Ederson Santana de Moraes GK 2024-2025 29
190 Mohamed Salah MID 2024-2025 84
191 Cole Palmer MID 2024-2025 79
192 Bryan Mbeumo MID 2024-2025 68
193 Bukayo Saka MID 2024-2025 63
194 Luis DΓ­az MID 2024-2025 60
195 Dwight McNeil MID 2024-2025 49
196 Noni Madueke MID 2024-2025 46
197 James Maddison MID 2024-2025 45
198 Jarrod Bowen MID 2024-2025 45
199 Emile Smith Rowe MID 2024-2025 41
InΒ [107]:
master_cleaned['value'].describe()
Out[107]:
count    42938.000000
mean         5.421533
std          1.396305
min          3.600000
25%          4.500000
50%          5.000000
75%          5.700000
max         15.400000
Name: value, dtype: float64

The following sections looks at key metrics relative to players positions.ΒΆ

InΒ [108]:
master_cleaned_copy["kickoff_time"] = pd.to_datetime(master_cleaned_copy["kickoff_time"])
master_cleaned_copy["season"] = master_cleaned_copy["kickoff_time"].apply(lambda x: f"{x.year}/{x.year + 1}" if x.month >= 8 else f"{x.year - 1}/{x.year}")

# we need to group players into bins
position_bins = {
    "GK": "Goalkeepers",
    "DEF": "Defenders",
    "MID": "Midfielders",
    "FWD": "Forwards",
}

master_cleaned_copy["position_bin"] = master_cleaned_copy["position"].map(position_bins) # now mapping positions to bins

#grouping by seaosn and posiiton bins
grouped = master_cleaned_copy.groupby(["season", "position_bin"]).agg({"name": "count", "xP": "mean"}).rename(columns={"name": "player_count", "xP": "avg_xP"}).reset_index()

print(grouped)
       season position_bin  player_count    avg_xP
0   2020/2021    Defenders          3161  2.750854
1   2020/2021     Forwards          1244  3.021624
2   2020/2021  Goalkeepers           709  3.711142
3   2020/2021  Midfielders          4139  2.821889
4   2021/2022    Defenders          3194  3.100423
5   2021/2022     Forwards          1258  3.087043
6   2021/2022  Goalkeepers           724  3.933218
7   2021/2022  Midfielders          4236  3.079072
8   2022/2023    Defenders          3608  2.595926
9   2022/2023     Forwards          1384  2.982117
10  2022/2023  Goalkeepers           769  3.557802
11  2022/2023  Midfielders          5085  2.730610
12  2023/2024    Defenders          3678  2.354323
13  2023/2024     Forwards          1333  3.091860
14  2023/2024  Goalkeepers           772  3.086593
15  2023/2024  Midfielders          5009  2.740397
16  2024/2025    Defenders           856  2.287675
17  2024/2025     Forwards           290  3.080345
18  2024/2025  Goalkeepers           182  3.128846
19  2024/2025  Midfielders          1307  2.450956

Now we will look at how different metrics align with total points for each position.

InΒ [109]:
master_cleaned_copy["was_home"] = master_cleaned_copy["was_home"].apply(lambda x: 1 if x == True else 0)  # hot encoding `was_home`
 
master_cleaned_copy = master_cleaned_copy.drop(columns=["name", "position", "team"])  # dropping categorical columns
 
target = "total_points"  
threshold = 0.2  # seetting the correlation threshold
 
for position, group in master_cleaned_copy.groupby("position_bin"):
    numeric_data = group.select_dtypes(include=["float64", "int64"])
    
    correlation = numeric_data.corr()[target].sort_values(ascending=False)
    
    correlation = correlation.drop(target)
    
    correlation = correlation[correlation.abs() > threshold] # filtering correlations by the threshold
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlation.to_frame(), annot=True, cmap="coolwarm", fmt=".2f", cbar=True, yticklabels=correlation.index)
    plt.title(f"Correlations with {target} for {position}")
    
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Interesting - we see that different independent variables are more highly correlated across positions which makes sense as GKs and defenders rely on clean sheets. Gks also rely on saves for points. Whereas, Mids and Fwds rely on goals and assists with Fwds having stronger correlation for goal scored and Mids for creative playmaking.

Defensive Player MetricsΒΆ

InΒ [110]:
gk_def_data = master_cleaned[master_cleaned['position'].isin(['GK', 'DEF'])]

# Calculate cumulative metrics
gk_def_data['cumulative_clean_sheets'] = gk_def_data.groupby(['name', 'Season'])['clean_sheets'].cumsum()
gk_def_data['cumulative_saves'] = gk_def_data.groupby(['name', 'Season'])['saves'].cumsum()
gk_def_data['cumulative_goals_conceded'] = gk_def_data.groupby(['name', 'Season'])['goals_conceded'].cumsum()
gk_def_data['cumulative_points'] = gk_def_data.groupby(['name', 'Season'])['total_points'].cumsum()

# Aggregate by Gameweek
def_data = gk_def_data[gk_def_data['position'] == 'DEF'].groupby('GW').sum(numeric_only=True)
gk_data = gk_def_data[gk_def_data['position'] == 'GK'].groupby('GW').sum(numeric_only=True)
InΒ [111]:
plt.figure(figsize=(10, 6))

# Primary axis for cumulative points
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(def_data.index, def_data['cumulative_points'], label='Points', color='green', marker='^', linestyle='-')
ax1.set_xlabel('Gameweek', fontsize=14)
ax1.set_ylabel('Cumulative Points (Seasonal Total)', fontsize=14, color='green')
ax1.tick_params(axis='y', labelcolor='green')
ax1.grid(alpha=0.3)

# Secondary axis for other metrics
ax2 = ax1.twinx()
ax2.plot(def_data.index, def_data['cumulative_clean_sheets'], label='Clean Sheets', color='blue', marker='o', linestyle='--')
ax2.plot(def_data.index, def_data['cumulative_goals_conceded'] / 10, label='Goals Conceded (Divided by 10)', color='brown', marker='o', linestyle='--')
ax2.set_ylabel('Other Metrics (Scaled)', fontsize=14, color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Legend and Title
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)

plt.title('Defenders: Seasonal Total Cumulative Metrics by Gameweek', fontsize=14)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image
InΒ [112]:
plt.figure(figsize=(10, 6))

# Primary axis for cumulative points
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(gk_data.index, gk_data['cumulative_points'], label='Points', color='green', marker='^', linestyle='-')
ax1.set_xlabel('Gameweek', fontsize=14)
ax1.set_ylabel('Cumulative Points (Seasonal Total)', fontsize=14, color='green')
ax1.tick_params(axis='y', labelcolor='green')
ax1.grid(alpha=0.3)

# Secondary axis for other metrics
ax2 = ax1.twinx()
ax2.plot(gk_data.index, gk_data['cumulative_clean_sheets'], label='Clean Sheets', color='blue', marker='o', linestyle='--')
ax2.plot(gk_data.index, gk_data['cumulative_goals_conceded'] / 10, label='Goals Conceded (Divided by 10)', color='brown', marker='o', linestyle='--')
ax2.plot(gk_data.index, gk_data['cumulative_saves'] / 10, label='Saves (Divided by 10)', color='salmon', marker='o', linestyle='--')
ax2.set_ylabel('Other Metrics (Scaled)', fontsize=14, color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Legend and Title
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)

plt.title('Goalkeepers: Seasonal Total Cumulative Metrics by Gameweek', fontsize=14)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

These trends compares key defensive metrics for goalkeepers and defenders over the course of a season, emphasizing howΒ clean sheets, goals conceded and saves andΒ total pointsΒ evolve by gameweek. For defender the cumulative points (green line) show a consistent rise, aligning closely with clean sheets (blue line), while goals conceded (brown line) contribute less significantly to their overall points. for Goalkeepers the cumulative points (green line) also increases steadily but at lower rates (as expected since GKs are not high-point earning potential players) but show a more signifcant contribution from saves (orange line) and clean sheets (blue line).

Offensive Player Metrics (Mids and Fwds)ΒΆ

InΒ [113]:
# Filter the data for midfielders (MID) and forwards (FWD)
mid_fwd_data = master_cleaned[master_cleaned['position'].isin(['MID', 'FWD'])]

# Calculate cumulative metrics
mid_fwd_data['cumulative_goals_scored'] = mid_fwd_data.groupby(['name', 'Season'])['goals_scored'].cumsum()
mid_fwd_data['cumulative_assists'] = mid_fwd_data.groupby(['name', 'Season'])['assists'].cumsum()
mid_fwd_data['cumulative_points'] = mid_fwd_data.groupby(['name', 'Season'])['total_points'].cumsum()

# Aggregate by Gameweek
mid_data = mid_fwd_data[mid_fwd_data['position'] == 'MID'].groupby('GW').sum(numeric_only=True)
fwd_data = mid_fwd_data[mid_fwd_data['position'] == 'FWD'].groupby('GW').sum(numeric_only=True)
InΒ [114]:
plt.figure(figsize=(10, 6))

# Primary axis for cumulative points
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(mid_data.index, mid_data['cumulative_points'], label='Points', color='green', marker='^', linestyle='-')
ax1.set_xlabel('Gameweek', fontsize=14)
ax1.set_ylabel('Cumulative Points (Seasonal Total)', fontsize=14, color='green')
ax1.tick_params(axis='y', labelcolor='green')
ax1.grid(alpha=0.3)

# Secondary axis for other metrics
ax2 = ax1.twinx()
ax2.plot(mid_data.index, mid_data['cumulative_goals_scored'], label='Goals Scored', color='blue', marker='o', linestyle='--')
ax2.plot(mid_data.index, mid_data['cumulative_assists'], label='Assists', color='orange', marker='o', linestyle='--')
ax2.set_ylabel('Other Metrics (Scaled)', fontsize=14, color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Legend and Title
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)

plt.title('Midfielders: Seasonal Total Cumulative Metrics by Gameweek', fontsize=14)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image
InΒ [115]:
plt.figure(figsize=(10, 6))

# Primary axis for cumulative points
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(fwd_data.index, fwd_data['cumulative_points'], label='Points', color='green', marker='^', linestyle='-')
ax1.set_xlabel('Gameweek', fontsize=14)
ax1.set_ylabel('Cumulative Points (Seasonal Total)', fontsize=14, color='green')
ax1.tick_params(axis='y', labelcolor='green')
ax1.grid(alpha=0.3)

# Secondary axis for other metrics
ax2 = ax1.twinx()
ax2.plot(fwd_data.index, fwd_data['cumulative_goals_scored'], label='Goals Scored', color='blue', marker='o', linestyle='--')
ax2.plot(fwd_data.index, fwd_data['cumulative_assists'], label='Assists', color='orange', marker='o', linestyle='--')
# ax2.plot(fwd_data.index, fwd_data['cumulative_threat'] / 100, label='Threat (Divided by 100)', color='red', marker='o', linestyle='--')
ax2.set_ylabel('Other Metrics (Scaled)', fontsize=14, color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Legend and Title
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)

plt.title('Forwards: Seasonal Total Cumulative Metrics by Gameweek', fontsize=14)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

These trends compares key offensive metrics for forwards and midfielders over the course of a season, emphasizing howΒ goals scored,Β assists, andΒ total pointsΒ evolve by gameweek. For forwards the cumulative points (green line) show a consistent rise, aligning closely with goals scored (blue line), while assists (orange line) contribute less significantly to their overall points whereas for Midfielders The cumulative points (green line) also increase steadily but show a more balanced contribution from both goals scored (blue line) and assists (orange line). This underscores the dual role midfielders play in both scoring and creating opportunities.

The following visualizations in this cateogry are more focused showing the correlations between specific performance-related metrics.

InΒ [116]:
# Filter and reorganize the dataset for relevant features
selected_columns = [
    'total_points', 'goals_scored', 'assists', 'expected_goals', 'expected_assists',
    'expected_goal_involvements', 'clean_sheets', 'minutes', 'penalties_missed',
    'influence', 'creativity', 'threat', 'bps'
]
copy = master_cleaned[selected_columns]

# Compute the correlation matrix
correlation_matrix = copy.corr()

# Sort the matrix by correlation with 'total_points'
correlation_matrix = correlation_matrix.sort_values(by='total_points', ascending=False, axis=0)
correlation_matrix = correlation_matrix.sort_values(by='total_points', ascending=False, axis=1)

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', 
    vmin=-1, vmax=1, cbar_kws={"shrink": 0.8}, linewidths=0.5
)
plt.title("Correlation Matrix of Key Player Performance Metrics", fontsize=12)
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.show()
No description has been provided for this image

The metric 'total_points' has strong correlations with 'bps' (bonus points system), 'goals_scored', and 'influence'. This indicates these are the primary drivers of a player's overall FPL performance.

Metrics like 'penalties_missed' show little to no correlation with total_points, suggesting they have a minimal impact.

Advanced metrics like 'expected_goal_involvements' and 'expected_goals' show strong relationships with 'goals_scored' and 'total_points', validating their predictive value for future performance.

InΒ [117]:
# Select key metrics for pairwise relationships
pairwise_metrics = ['total_points', 'bps', 'goals_scored', 'expected_goal_involvements']
sns.pairplot(master_cleaned[pairwise_metrics], kind='reg', diag_kind='kde', palette='coolwarm')
plt.suptitle("Key Pairwise Relationships Between Metrics", y=1.02, fontsize=16, fontweight='bold')
plt.show()
No description has been provided for this image

Insights from the pairwise relationships:

  1. Total points and BPS shows a positive linear relationship. BPS is a strong indicator of total points as it reflects a player's overall contribution in a match (tackles, passes, etc.).
  2. Total points increase with goals scored, as goals directly contribute to a player's point tally.
  3. Total points and expected goal involvments correlation is less linear, as not all points come from goals (e.g. clean sheets or assists also contribute).
  4. BPS and expected goal involvments show some positive association, but not as strong as other indicators.
  5. Goals Scored and Expected Goal Involvements show a moderate linear correlation. Players who score more tend to have higher xG/xA metrics.

The graph below summarizes the main metrics we have analyzed and their correlation with our target variable, 'total_points'.

Category 2: Advanced Metrics (xG and xA)ΒΆ

  • This section introduces expected goals (xG) and expected assists (xA) as metrics that provide deeper insights by quantifying the quality of chances, helping to differentiate between sustainable performance and statistical anomalies.

  • xG measures the cumulative probability of scoring based on the quality of chances, while xA estimates the likelihood that a pass will lead to a goal.

  • xG and xA quantify the quality of chances created or taken, providing a reliable indicator of a player's underlying performance. They help identify players who are overperforming or underperforming relative to expectations.

InΒ [118]:
def plot_multiple_players_xg_vs_goals(player_names, season):
    plt.figure(figsize=(12, 8))  # Define figure size
    color_palette = sns.color_palette("tab10", len(player_names))  # Generate distinct colors for each player

    for idx, player_name in enumerate(player_names):
        # Filter data for the player and season
        player_data = master_cleaned[(master_cleaned['name'].str.contains(player_name, case=False, na=False)) & (master_cleaned_copy['season'] == season)]
        
        player_data = player_data.sort_values(by='GW')  # Sorting by Gameweek
        
        # Assign a color for the player
        player_color = color_palette[idx]
        
        # Plot cumulative goals and xG for the player
        plt.plot(player_data['GW'], player_data['cumulative_goals'], label=f"{player_name} - Goals", color=player_color, linewidth=2)
        plt.plot(player_data['GW'], player_data['cumulative_xG'], label=f"{player_name} - xG", color=player_color, linestyle="--", linewidth=2)

    # Set plot title and labels
    plt.title(f"Goals vs Expected Goals ({season})", fontsize=16, fontweight='bold')
    plt.xlabel("Gameweek", fontsize=12)
    plt.ylabel("Cumulative Count", fontsize=12)

    # Custom legend to group goals and xG by color
    custom_lines = [
        mlines.Line2D([], [], color=color_palette[i], linewidth=2, label=f"{player_names[i]} - Goals") for i in range(len(player_names))
    ] + [
        mlines.Line2D([], [], color=color_palette[i], linestyle="--", linewidth=2, label=f"{player_names[i]} - xG") for i in range(len(player_names))
    ]
    plt.legend(handles=custom_lines, fontsize=10, loc='upper left', bbox_to_anchor=(1, 1))

    # Add grid and adjust layout
    plt.grid(alpha=0.5)
    plt.tight_layout()
    plt.show()


# Example usage
player_names = ["Erling Haaland", "Kai Havertz"]  # List of player names
season = "2022/2023"  # Season
plot_multiple_players_xg_vs_goals(player_names, season)
No description has been provided for this image
  • This chart compares the cumulative Goals and Expected Goals (xG) for Erling Haaland and Kai Havertzβ€”across the 2022/2023 season.
  • Erling Haaland significantly exceeds his xG, showing exceptional finishing ability and efficiency, as his goals curve consistently outpaces his xG.
  • Kai Havertz, however, lags behind his xG in the second half of the season, highlighting inefficiencies in converting chances.
  • This analysis shows that xG is an interesting metric for assessing goals, with some players outperforming and underperforming.
InΒ [119]:
def plot_multiple_players_xa_vs_assists(player_names, season):
    plt.figure(figsize=(12, 8))  # Define figure size
    color_palette = sns.color_palette("tab10", len(player_names))  # Generate distinct colors for each player

    for idx, player_name in enumerate(player_names):
        # Filter data for the player and season
        player_data = master_cleaned[(master_cleaned['name'].str.contains(player_name, case=False, na=False)) & (master_cleaned_copy['season'] == season)]
        
        player_data = player_data.sort_values(by='GW')  # Sorting by Gameweek
        
        # Assign a color for the player
        player_color = color_palette[idx]
        
        # Plot cumulative goals and xG for the player
        plt.plot(player_data['GW'], player_data['cumulative_assists'], label=f"{player_name} - Assists", color=player_color, linewidth=2)
        plt.plot(player_data['GW'], player_data['cumulative_xA'], label=f"{player_name} - xA", color=player_color, linestyle="--", linewidth=2)

    # Set plot title and labels
    plt.title(f"Assists vs Expected Assists ({season})", fontsize=16, fontweight='bold')
    plt.xlabel("Gameweek", fontsize=12)
    plt.ylabel("Cumulative Count", fontsize=12)

    # Custom legend to group goals and xG by color
    custom_lines = [
        mlines.Line2D([], [], color=color_palette[i], linewidth=2, label=f"{player_names[i]} - Assists") for i in range(len(player_names))
    ] + [
        mlines.Line2D([], [], color=color_palette[i], linestyle="--", linewidth=2, label=f"{player_names[i]} - xA") for i in range(len(player_names))
    ]
    plt.legend(handles=custom_lines, fontsize=10, loc='upper left', bbox_to_anchor=(1, 1))

    # Add grid and adjust layout
    plt.grid(alpha=0.5)
    plt.tight_layout()
    plt.show()


# Example usage
player_names = ["Bukayo Saka", "Bruno Fernandes"]  # List of player names
season = "2023/2024"  # Season
plot_multiple_players_xa_vs_assists(player_names, season)
No description has been provided for this image

This chart compares Saka and Fernandes in the 2023/2024 season. Saka's assists exceed his xA, indicating overperformance, whereas Fernandes slightly underperforms relative to his xA.

Understanding over- and underperformance is critical for identifying different types of players. Players like Haaland, who consistently overperform their xG or xA, demonstrate elite finishing or creativity, highlighting their unique ability to convert opportunities beyond statistical expectations. On the other hand, players underperforming these metrics may indicate inefficiency or bad luck, but they could also represent undervalued opportunities if their underlying statistics remain strong and consistent. This insight helps managers differentiate between sustainable excellence and potential rebounds in performance.

InΒ [120]:
# Filter data for the 2024-2025 season
season_2024_2025 = master_cleaned[master_cleaned['Season'] == '2024-2025']

# Aggregate data to calculate total expected goals and actual goals scored
aggregated_data = season_2024_2025.groupby(['name']).agg({
    'expected_goals': 'sum',  # Aggregating total expected goals
    'goals_scored': 'sum'    # Aggregating total goals scored
}).reset_index()

# Filter top performers based on Aggregated Expected Goals or Actual Goals Scored
top_performers = aggregated_data[
    (aggregated_data['expected_goals'] > 4) | 
    (aggregated_data['goals_scored'] > 4)
]

# Plotting the scatter plot without position hue
plt.figure(figsize=(12, 8))
sns.scatterplot(
    x='expected_goals', 
    y='goals_scored', 
    data=top_performers, 
    edgecolor="w", 
    s=100, 
    color='blue'  # Single color for all points
)

# Add a reference line for x = y
max_value = max(top_performers['expected_goals'].max(), 
                top_performers['goals_scored'].max())
plt.plot(
    [0, max_value], 
    [0, max_value], 
    'k--', linewidth=1, label="x = y"
)

# Annotate each player by name
for _, row in top_performers.iterrows():
    plt.text(
        row['expected_goals'], 
        row['goals_scored'], 
        row['name'],
        fontsize=9, 
        alpha=0.9,
        rotation=45
    )

# Customize plot
plt.title("Aggregated Expected Goals vs Actual Goals Scored by Player (2024-2025)", fontsize=14, fontweight='bold')
plt.xlabel("Aggregated Expected Goals", fontsize=12)
plt.ylabel("Aggregated Goals Scored", fontsize=12)
plt.grid(alpha=0.3)
plt.xlim(0, top_performers['expected_goals'].max() + 1)
plt.ylim(0, top_performers['goals_scored'].max() + 1)
plt.tight_layout()
plt.show()
No description has been provided for this image

The scatterplot compares Aggregated Expected Goals (xG) to Actual Goals Scored for top-performing players so far this season. The diagonal line represents perfect alignment between xG and goals scored (x = y). Players above the line, such as Erling Haaland, have outperformed their xG, suggesting exceptional finishing ability, a favorable streak, or positive variance. Conversely, players below the line, like Kai Havertz and Brennan Johnson, are generating strong underlying numbers but may have been on the wrong side of variance or unlucky with their finishing. Understanding this relationship is critical for assessing player sustainability. Overperformance may not always be repeatable, while underperforming players with strong xG figures could represent undervalued opportunities likely to deliver better returns over time. This analysis emphasizes the importance of xG in identifying both reliable performers and potential breakout candidates.

Extra Analysis: BPS and its relation to Player PositionΒΆ

InΒ [121]:
total_season_bonus = (master_cleaned.groupby(['name', 'position', 'Season'])['bonus'].sum().reset_index())

N = 50  
top_n_players_season = (total_season_bonus.groupby(['Season']).apply(lambda group: group.nlargest(N, 'bonus')).reset_index(drop=True))

bonus_points_by_position = (top_n_players_season.groupby(['Season', 'position'])['bonus'].mean().reset_index()) #aggregating bonus points by position and season for top 'N' players (note: this changes o/p)

plt.figure(figsize=(12, 8))
sns.barplot(data=bonus_points_by_position,x='position',y='bonus',hue='Season')

plt.title(f"Average Bonus Points by Position and Season (Top {N} Players)", fontsize=16)
plt.xlabel("Position", fontsize=12)
plt.ylabel("Average Bonus Points", fontsize=12)
plt.legend(title="Season", loc='upper left')

plt.show()
No description has been provided for this image
  • This chart shows the average bonus points for the top 50 players for different FPL player positions.
  • Midfielders (MID) and Forwards (FWDs) clearly dominate BPS throughout the season, reflecting their balanced contribution to goals, assists, and defensive actions, which makes them a central part of any fantasy team.
InΒ [122]:
master_cleaned_copy.columns
Out[122]:
Index(['xP', 'assists', 'bonus', 'bps', 'clean_sheets', 'creativity',
       'element', 'fixture', 'goals_conceded', 'goals_scored',
       'Influence_Creativity_Threat_Index', 'influence', 'kickoff_time',
       'minutes', 'opponent_team', 'own_goals', 'penalties_missed',
       'penalties_saved', 'red_cards', 'round', 'saves', 'selected',
       'team_a_score', 'team_h_score', 'threat', 'total_points',
       'transfers_balance', 'transfers_in', 'transfers_out', 'value',
       'was_home', 'yellow_cards', 'GW', 'expected_goals', 'expected_assists',
       'expected_goal_involvements', 'position_DEF', 'position_FWD',
       'position_GK', 'position_MID', 'team_label', 'Hour', 'DayOfWeek',
       'Weekend', 'WeekOfYear', 'Month', 'Year', 'Season', 'gpm', 'apm',
       'cumulative_goals', 'cumulative_assists', 'cumulative_xG',
       'cumulative_xA', 'cumulative_xGI', 'cumulative_gpm', 'cumulative_apm',
       'cumulative_xP', 'cumulative_points', 'cumulative_minutes', 'season',
       'position_bin'],
      dtype='object')

Category 3: Miscellaneous graphs (important metrics which affect the total points of the players)ΒΆ

Home and Away MatchesΒΆ

Examining home versus away performance and key match metrics reveals the contextual factors influencing player output. This analysis provides insights into how match location and key game events contribute to total points.

InΒ [123]:
# Home vs Away performance
sns.boxplot(x='was_home', y='total_points', data=master_cleaned, palette=['red', 'blue'])
plt.title("Home vs Away Performance (Total Points)")
plt.xlabel("Was Home")
plt.ylabel("Total Points")
plt.show()
No description has been provided for this image

Home advantage is evident from the higher average and median points at home. Such insights can influence fantasy team captain choices for different fixtures. For example, for a liverpool home match, managers could choose liverpool members over other premier league team members.

Below is a trend looking at the home and away matches from another angle. This looks at whether specific players perform better at home vs away matches.

InΒ [124]:
# Calculate Home Points and Away Points based on 'was_home'
home_points_2 = master_cleaned[master_cleaned['was_home'] == True].groupby(['name', 'position']).agg({
    'total_points': 'mean'
}).rename(columns={'total_points': 'Home Points'})

away_points_2 = master_cleaned[master_cleaned['was_home'] == False].groupby(['name', 'position']).agg({
    'total_points': 'mean'
}).rename(columns={'total_points': 'Away Points'})

# Merge Home and Away Points
player_comparison_filtered = home_points_2.merge(away_points_2, on=['name', 'position'], how='outer').reset_index()

# Calculate Total Points (sum of average Home and Away Points)
player_comparison_filtered['Total Points'] = (
    player_comparison_filtered['Home Points'].fillna(0) + 
    player_comparison_filtered['Away Points'].fillna(0)
)

# Sort by Total Points and filter the top 50 players
top30_players = player_comparison_filtered.sort_values(by='Total Points', ascending=False).head(30)

# Plotting the scatter plot with hue based on player position
plt.figure(figsize=(8, 6))
sns.scatterplot(
    x='Away Points', 
    y='Home Points', 
    hue='position', 
    data=top30_players, 
    edgecolor="w", 
    s=100,
    palette="Set2"
)

# Add a reference line for x = y
max_value = top30_players[['Away Points', 'Home Points']].max().max()
plt.plot(
    [0, max_value], 
    [0, max_value], 
    'k--', linewidth=1, label="x = y"
)

# Annotate players' names
for _, row in top30_players.iterrows():
    plt.text(
        row['Away Points'], 
        row['Home Points'], 
        row['name'], 
        fontsize=8, 
        alpha=0.8,
        rotation=45
    )

# Add plot enhancements
plt.title("Top 30 Player Performances: Home vs Away", fontsize=14, fontweight='bold')
plt.xlabel("Average Away Points", fontsize=12)
plt.ylabel("Average Home Points", fontsize=12)
plt.grid(alpha=0.3)
plt.xlim(0, top30_players['Away Points'].max() + 0.05)
plt.ylim(0, top30_players['Home Points'].max() + 0.05)
plt.legend(title="Position", fontsize=10)
plt.show()
No description has been provided for this image

The majority of the data points cluster close to the diagonal line x = y, indicating that for most players, their performance at home and away is relatively similar. However, there are noticeable variations where some players perform significantly better either at home (above the diagonal) or away (below the diagonal).

  1. Midfielders (Orange): Some exhibit standout performances at home, reflected in their higher values on the y-axis.
  2. Forwards (Green): Many forwards are near or slightly above the x = y line, suggesting their performance might be more consistent but with a slight home advantage.

Gareth Bale shows exceptionally strong home performance relative to their away stats. Players closer to the diagonal line (e.g. Harry Kane) demonstrate balanced performance across home and away matches. Teams could leverage this data to select players for specific matches. For instance, away matches might require players like Hourihane, while home matches could benefit from players like Bale.

This metric can help managers decide which players are better suited for home vs away games.

InΒ [125]:
# Calculate total goals, assists, and clean sheets for home and away games
metrics_home_away = master_cleaned.groupby('was_home')[['goals_scored', 'assists', 'clean_sheets']].sum().reset_index()

# Bar plot for total metrics
metrics_home_away_melted = metrics_home_away.melt(id_vars='was_home', var_name='Metric', value_name='Count')

# Explicit labeling with custom legend
plt.figure(figsize=(12, 6))
sns.barplot(
    x='Metric', 
    y='Count', 
    hue='was_home', 
    data=metrics_home_away_melted, 
    palette=['red', 'blue']
)

# Add plot enhancements
plt.title("Total Goals, Assists, and Clean Sheets (Home vs Away)", fontsize=14, fontweight='bold')
plt.ylabel("Count", fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image
  • Home matches outperform away matches in all three metrics: goals scored, assists, and clean sheets.
  • The largest difference is observed in clean sheets, suggesting stronger defensive performances at home.

Penalty and Red Card ImpactΒΆ

The following trends are investigating penalties missed and red cards to assesses their negative impact on player scores and overall performance.

InΒ [126]:
# Analyze penalties missed/saved impact
sns.boxplot(x='penalties_missed', y='total_points', data=master_cleaned, palette='Set2')
plt.title("Impact of Penalties Missed on Total Points")
plt.xlabel("Penalties Missed")
plt.ylabel("Total Points")
plt.show()
No description has been provided for this image

Players who missed a penalty (indicated by 1 on the x-axis) generally show a lower distribution of 'total_points' than players who did not miss a penalty (indicated by 0 on the x-axis). This could be because players who usually perform well are chosen to take these kicks.

InΒ [127]:
# Red cards impact
sns.boxplot(x='red_cards', y='total_points', data=master_cleaned, palette='Set1')
plt.title("Impact of Red Cards on Total Points")
plt.xlabel("Red Cards")
plt.ylabel("Total Points")
plt.show()
No description has been provided for this image

Players receiving a red card (indicated by 1 on the x-axis) have a significantly lower median 'total_points' compared to those without red cards (indicated by 0). Unlike penalties missed, red cards appear to have a more consistent and severe impact on fantasy scores.

Total Points vs Minutes PlayedΒΆ

InΒ [128]:
master_cleaned = pd.read_csv("../master_cleaned.csv")

# Define bins and labels
bins = [5, 15, 30, 45, 60, 75, 90]
labels = ['5-15', '15-30', '30-45', '45-60', '60-75', '75-90']

# Create a new column for binned minutes
master_cleaned['minute_bins'] = pd.cut(master_cleaned['minutes'], bins=bins, labels=labels, right=False)

# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='minute_bins', y='total_points', data=master_cleaned, palette='Blues', cut=0)

# Customize the plot
plt.title('Total Points Distribution by Minute Bins', fontsize=16)
plt.xlabel('Minute Bins', fontsize=14)
plt.ylabel('Total Points', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Each violin represents the distribution of total_points for players who played within specific minute_bins. The wider sections of the violins indicate where the density of total_points is higher. The 75-90 bin shows a broader distribution compared to 5-15, meaning players playing 75-90 minutes tend to have a wider range of total points.

The plot reveals that players who play more minutes generally score higher total points (but also can score less points than in lower minute bins). The violins widen and shift higher on the y-axis for bins like 60-75 and 75-90.

The vertical extent of the violins shows the spread of outliers. For instance, in the 75-90 bin, there are instances of extremely low or high total points, reflecting variability in player performance even with significant playing time.

Influence Creativity Threat IndexΒΆ

InΒ [129]:
# Influence-Creativity-Threat Index vs Total Points
sns.scatterplot(x='Influence_Creativity_Threat_Index', y='total_points', hue='position', data=master_cleaned, palette='bright', alpha=0.7)
plt.title("Influence-Creativity-Threat Index vs Total Points")
plt.xlabel("Influence-Creativity-Threat Index")
plt.ylabel("Total Points")
plt.legend(title="Position")
plt.show()
No description has been provided for this image

There appears to be a positive trend between players with higher 'Influence-Creativity-Threat-Index' and 'total_points'.Forward players dominate high threat and total points due to their primary scoring roles. However, midfielders with balanced indices contribute equally, underlining versatility.

InΒ [130]:
# Key event metrics across gameweeks
gameweek_metrics = master_cleaned.groupby('GW')[['goals_scored', 'assists', 'clean_sheets']].sum()
gameweek_metrics.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title("Key Metrics Across Gameweeks")
plt.xlabel("Gameweek")
plt.ylabel("Count")
plt.legend(title="Metrics")
plt.show()
No description has been provided for this image

Clean sheets number higher than goals and assists, meaning that defenders and goalkeepers with a good clean sheet record are valuable. Goals_scored and assists are fewer in number, which indicates that they are rarer events in a football match compared to clean sheets. However, this is offset by the fact that they carry higher points when they take place (greater point-earning potential). Fantasy football managers often use gameweek trends to plan their transfers and team strategies.

Category 4: Player ValueΒΆ

InΒ [131]:
price_bins = [3.5, 4.9, 5.5, 6.0, 7.9, 15.5] #defining price bins and labels for our data
price_labels = ["4.0-4.9", "5.0-5.5", "5.6-6.0", "6.1-7.9", "8.0+"]
master_cleaned['price_range'] = pd.cut(master_cleaned['value'], bins=price_bins, labels=price_labels, right=False)

#players prices change throughout the season si we choose price in the first GW of the sewasoon 
start_of_season_prices = (master_cleaned.sort_values(['Season', 'GW']).groupby(['name', 'position', 'Season'])['price_range'].first().reset_index())

total_season_points = (master_cleaned.groupby(['name', 'position', 'Season'])['total_points'].sum().reset_index()) #calculating total season points for each player

total_season_points = total_season_points.merge(start_of_season_prices, on=['name', 'position', 'Season'], how='left') # mergin start-of-season price range into the total_season_points DataFrame

N = 50  # yall know what this is by now
top_n_players_season = (total_season_points.groupby(['Season']).apply(lambda group: group.nlargest(N, 'total_points')).reset_index(drop=True))

avg_points_by_price_position = (top_n_players_season.groupby(['price_range', 'position'])['total_points'].mean().reset_index()) # calculating avg total season points for each price bin and position across seasons

plt.figure(figsize=(12, 8))
sns.barplot(data=avg_points_by_price_position,x='price_range',y='total_points',hue='position')

plt.title(f"Average Total Season Points by Price Range and Position (Top {N} Players, All Seasons)", fontsize=16)
plt.xlabel("Price Range", fontsize=12)
plt.ylabel("Average Total Season Points", fontsize=12)
plt.legend(title="Position", loc='upper left')
plt.show()
No description has been provided for this image

The bar chart displays the average total season points for the top 50 players, grouped by price range and position, aggregated across all seasons. Midfielders and forwards are not represented in the Β£4–4.5 million price range because players in these positions are rarely priced this low. When they are, they usually don't feature as regular starters, which is why their performance data is not included in this bin. Similarly GKs abd DEFs are never priced in the premium 8 million + range (typically dominated by MIDs and FWDs). High price defenders tyipcally outperform mid priced midfielders and fwds (5.6 - 6 mil bracket). Preimum mids and fwds score significantly more points, reflecting their premium cost and contribution.

This analysis can help managers with bargain hunting. For example, a manger with a 5.6-6 million budget would be better served choosing a defender. However, if that manager had over 6 million, we see that forwards and midfielders outperform defenders, so he would be better served pursuing an offensive purchase strategy.

Specific Player AnalysisΒΆ

Top players by total pointsΒΆ

InΒ [132]:
top_players = master_cleaned.nlargest(80, 'cumulative_points')
sns.barplot(x='cumulative_points', y='name', data=top_players)
plt.title("Top 10 Players by Total Points")
plt.xlabel("Total Points")
plt.ylabel("Player Name")
plt.show()
No description has been provided for this image

This chart highlights the top performers in terms of total points, with Erling Haaland, Mohamed Salah, and Harry Kane leading the list. These players are likely to have consistent performance across matches and recurring impressive performances. The ranking provides insight for team selection, especially for fantasy leagues, by identifying players who contribute the most points.

Below we take a random player (Cole Palmer, position = MID) and analyze his performance for FPL insights.

InΒ [133]:
cole_palmer_data = master_cleaned[(master_cleaned['name'] == "Cole Palmer") & (master_cleaned['Season'] == "2023-2024")].copy()

# Sort by Gameweek to ensure proper ordering
cole_palmer_data = cole_palmer_data.sort_values(by='GW')

# Plot cumulative points on the primary y-axis and key metrics on the secondary y-axis
fig, ax1 = plt.subplots(figsize=(10, 7))

# Primary y-axis for cumulative points
line1 = ax1.plot(cole_palmer_data['GW'], cole_palmer_data['cumulative_points'], label="Cumulative Points", color='blue', marker='o', linestyle='-', linewidth=2)
ax1.set_xlabel("Gameweek", fontsize=12)
ax1.set_ylabel("Cumulative Points", fontsize=12, color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Secondary y-axis for key metrics (BPS, Threat, Influence)
ax2 = ax1.twinx()
line2 = ax2.plot(cole_palmer_data['GW'], cole_palmer_data['bps'].cumsum(), label="BPS", color='green', marker='o', linestyle='--', linewidth=1.5)
line3 = ax2.plot(cole_palmer_data['GW'], cole_palmer_data['threat'].cumsum(), label="Threat", color='orange', marker='o', linestyle='--', linewidth=1.5)
line4 = ax2.plot(cole_palmer_data['GW'], cole_palmer_data['influence'].cumsum(), label="Influence", color='purple', marker='o', linestyle='--', linewidth=1.5)
ax2.set_ylabel("Metric Values", fontsize=12, color='black')
ax2.tick_params(axis='y', labelcolor='black')

# Combine legends
lines = line1 + line2 + line3 + line4 
labels = [l.get_label() for l in lines]
ax1.legend(lines, labels, loc="upper left", fontsize=10)

# Add title and grid
plt.title("Cole Palmer 23/24: Cumulative Points vs Key Metrics by Gameweek", fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image
  • This visualization ties Palmer’s weekly performance metrics to his cumulative contributions.
  • The metrics of Threat, Influence, and Bonus Points System (BPS) are critical drivers of total points in FPL, as they directly reflect a player's attacking potential, overall impact on matches, and consistency in earning bonus points.
  • Peaks in BPS and Threat (e.g., Gameweek 33) coincide with major increases in cumulative points.
  • The cumulative assessment effectively captures performance over time, highlighting both consistency and standout moments across gameweeks, offering a comprehensive view of a player's contribution.
InΒ [134]:
# Define a function to normalize the metrics across all players to a range of [0, 1]
def normalize(series):
    return (series - series.min()) / (series.max() - series.min()) if series.max() != series.min() else series / series.max()

# Select the players for comparison
players = ['Cole Palmer', 'Bukayo Saka']
metrics = ['total_points', 'bps', 'threat', 'influence', 'expected_goals', 'expected_assists', 'goals_scored', 'assists', 'value']

# Normalize the metrics across all players first
master_cleaned[metrics] = master_cleaned[metrics].apply(normalize)

# Filter the data for the selected players and calculate their mean metrics
player_data = master_cleaned[master_cleaned['name'].isin(players)].groupby('name')[metrics].mean()

# Create radar chart
categories = metrics
num_vars = len(categories)

# Compute angle for each metric
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]

# Plot data for each player
fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))

for player in player_data.index:
    values = player_data.loc[player].tolist()
    values += values[:1]  # Close the radar chart
    # ax.fill(angles, values, alpha=0.25, label=player)
    ax.plot(angles, values, linewidth=2, label=player)

# Add descriptors
ax.set_yticks([])
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, fontsize=10)

plt.title("Player Performance Metrics: Cole Palmer vs Bukayo Saka", fontsize=12, fontweight='bold', pad=20)
plt.legend(bbox_to_anchor=(1.2, 1.1), fontsize=10)
plt.show()
No description has been provided for this image

Here we are comparing key performance metrics for two elite midfielders. Typically the budget limits how many elite members you can have on your team so when making tough decisions it is important to look at cross-cutting metrics to make a better judgment.

We graphed a radar plot to show how the two players' performances compare in important metrics. Palmer outperforms Saka in nearly every metric, including our label total_points. Despite this, Saka is valued significantly higher.

This goes to show that the right metric analysis can help managers choose better players and prevent potential value and reputational bias.

Correlation OutcomesΒΆ

InΒ [135]:
columns_of_interest = [
    'value', 'expected_goals', 'expected_assists', 'minutes',
    'clean_sheets', 'saves', 'bps', 'was_home',
    'minutes', 'creativity', 'influence', 'threat'
]

# Calculate correlations with `total_points`
correlation_dict = {col: master_cleaned[col].corr(master_cleaned['total_points']) for col in columns_of_interest}

# Convert to a pandas Series for sorting
correlation_series = pd.Series(correlation_dict).sort_values()

# Plot the ascending bar chart
plt.figure(figsize=(12, 8))

# Plot the bars with distinct colors
bars = plt.barh(correlation_series.index, correlation_series.values, edgecolor='black')

plt.title('Correlation between Player Metrics and Total Points', fontsize=16, fontweight='bold')
plt.xlabel('Correlation Coefficient', fontsize=14)
plt.ylabel('Feature', fontsize=14)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image

This graph shows the correlation coefficients between various player metrics and their total points in FPL. The Bonus Points System (BPS) and influence have the highest positive correlation with total points, indicating that they are strong predictors of player performance. Metrics such as clean sheets, expected goals, threat and minutes also show strong positive correlations, reflecting their importance in contributing to overall player scores. Metrics like saves and was_home have weaker correlations, suggesting their impact on total points is more situational or specific to certain player types, such as goalkeepers.

Key TakeawaysΒΆ

  1. The key takeaway from all the graphs is that player performance in FPL is multifaceted.
  2. A combination of metrics such as BPS (Bonus Points System), Influence, Threat, Expected Goals (xG), Goals Scored, and Clean Sheets... etc provide a more comprehensive understanding.
  3. Position analysis: Midfielders and Forwards are consistently amongst the highest scoring positions. Metrics related to goals and assists seem to better describe and predict their behavior. Defenders and Goalkeepers seem to be relatively better correlated with metrics like clean sheets and/or saves (Gks) for their performance.
  4. Home advantage is evident in the gameweek trend, with higher total points scored during home games. This can be used to a manager's advantage during home fixtures.
  5. Advanced metrics like xG and xA (underlying metrics) are good assessors of underlying data i.e., they measure whether a player is consistently incurring quality chances despite what an outcome based metric like goals scored or assists provided illustrates.
  6. Value-for-Money: Value of player can be a misleading metric on its own, when looked at with other metrics, managers can make better judgements to optimize their squad within the budget constraints.